Genome assembly of the deep-sea coral Lophelia pertusa

Like their shallow-water counterparts, cold-water corals create reefs that support highly diverse communities, and these structures are subject to numerous anthropogenic threats. Here, we present the genome assembly of Lophelia pertusa from the southeastern coast of the USA, the first one for a deep-sea scleractinian coral species. We generated PacBio continuous long reads data for an initial assembly and proximity ligation data for scaffolding. The assembly was annotated using evidence from transcripts, proteins, and ab initio gene model predictions. This assembly is comparable to high-quality reference genomes from shallow-water scleractinian corals. The assembly comprises 2,858 scaffolds (N50 1.6 Mbp) and has a size of 556.9 Mbp. Approximately 57% of the genome comprises repetitive elements and 34% of coding DNA. We predicted 41,089 genes, including 91.1% of complete metazoan orthologs. This assembly will facilitate investigations into the ecology of this species and the evolution of deep-sea corals.

80 mbsl off the coast of Norway to over 1000 mbsl on the Mid-Atlantic Ridge. Although L. pertusa is arguably the best-studied deep-sea coral species, a high-quality reference genome assembly is still missing. This hinders our understanding of the biology of this coral species, its ecological functions, and its capacity to survive anthropogenic threats.
Here, we present the genome assembly of Lophelia pertusa, the first one for a deep-sea scleractinian coral species. Only one other genomic-level DNA sequence dataset was published for D. pertusum. Emblem and collaborators [13] produced 73 million SOLiD ligation sequencing reads and 1.2 million 454 pyrosequencing reads with average lengths of 46 bp and 580 bp, respectively. The Emblem dataset was useful for detecting mitochondrial single nucleotide polymorphisms but needed higher coverage and to be more cohesive to produce a useful genome assembly. Our study used PacBio continuous long reads (CLR) data for the initial assembly, followed by proximity ligation data for scaffolding and RNA-seq data for annotation. Our approach yielded a genome assembly of comparable quality to those obtained from shallow-water scleractinian corals [14][15][16][17].

Sample collection
Branches of Lophelia pertusa were obtained from the Savannah Banks site, off the southeastern coast of the continental USA, Atlantic Ocean (latitude 31.75420, longitude −79.19442, depth 515 mbsl), while aboard the NOAA Ship Ronald Brown (expedition RB1903) using ROV Jason (Dive 1130) on April 17, 2019 (BioSample accession SAMN31822850). The branches were collected using a hydraulic robotic arm and stored in an insulted biobox until they reached the surface (Figures 1c,d). Once onboard the ship, they were immersed in cold RNALater (Thermo Fisher), left to soak in the refrigerator (4°C) for 24 hours, and then frozen at −80°C. Samples remained at that temperature until DNA was purified in the laboratory.

DNA purification
Polyp tissue was scraped from the skeleton and digested in 2% cetyltrimethylammonium bromide (CTAB) buffer with 0.5% β-mercaptoethanol for 15 minutes at 68°C. The DNA was purified through two rounds of phenol: chloroform: isoamyl alcohol (25:24:1) and one round of chloroform: isoamyl alcohol (24:1) mixing and partitioning through centrifugation at 10,000 rpm for 10 minutes. The DNA was precipitated out of the solution with 100% isopropanol. The resulting pellet was washed with 70% ethanol, then air-dried and resuspended in Qiagen G2 buffer. DNA concentration was quantified using a Qubit fluorometer (Invitrogen). The DNA was further purified using the Blood & Cell Culture DNA Midi Kit (Qiagen kit #13343) following the manufacturer's protocol after one hour of protease digestion. The average DNA fragment size was determined using pulsed-field gel electrophoresis (PFGE).

DNA sequencing
A total of 19.3 Gbp contained in 2.07 million CLR were generated using a PacBio Sequel sequencer. For this, a 20 kb PacBio SMRTbell library was constructed using Blue Pippin Size selection. Long-insert chromosome conformation capture Chicago [18] and Hi-C [19] libraries (one each) were constructed and sequenced on an Illumina Hiseq X sequencer (paired-end,150bp), yielding 46.7 Gbp (156 million pairs) for the Chicago library and 72.6 Gbp (242 million pairs) for the Hi-C library.
Assemblies generated with other programs were not included because they had lower assembly contiguity or completeness (see the 'Data validation and quality control' section,   Table 1. Data inputs are indicated in maroon font. Software packages are highlighted with blue background.

Scaffolding
Assembly E was scaffolded with long-insert Chicago and Hi-C reads following the Arima Genomics mapping pipeline A160156 v02 (retrieved from https://github.com/ArimaGenomics/mapping_pipeline). First, the reads from the Chicago library were aligned to assembly E using the MEM algorithm of the program BWA (v0.7.17; RRID:SCR_010910) [27]. Chicago and Hi-C sequence data had mapping rates to the assembly of 96% and 98%, respectively, indicating high quality. Chimeric reads that mapped in the 3′ direction were excluded using the Arima-HiC Mapping pipeline filter_five_end.pl script [28]. Reads were combined into pairs with the two_read_bam_combiner.pl script and sorted using Samtools (v.1.10; RRID:SCR_002105) [29]. The program Picard tools (v2.26.6;  [31,32] (-e GATC -m yes) was used for scaffolding assembly E with the mapped Chicago reads (assembly F). The Hi-C reads were mapped to assembly H using the same procedure described above and re-scaffolded with SALSA2 (assembly H).
The protein products of the predicted coding gene models were functionally annotated using the Funnannotate annotate script. The following annotations were added: (1)

Quality control
The quality of each assembly was assessed using Quast (v5.0.2; RRID:SCR_001228) [62] and BUSCO v5 [60] (genome analysis with the metazoan lineage orthologs dataset OrthoDB v10 [59]). The steps described in the de novo assembly and scaffolding pipelines were implemented to maximize the contiguity, measured by the N50 statistic, and the completeness, measured by the percentage of single-copy metazoan orthologs present, in the assembly. The final assembly, assembly I, had an N50 of 1.61 Mbp, 5 to 10 times greater than the N50 of initial de novo assemblies without merging or scaffolding (assemblies A, B, and C). Similarly, assembly I had 89% complete single-copy metazoan orthologs of the 954 surveyed, which was between 7% and 18% more than initial de novo assemblies. Quality metrics for the final assembly (I) are shown in  Table 1.
The quality of genome assembly I is comparable to those obtained from shallow-water scleractinian corals. For comparison, we retrieved available genome assemblies of scleractinian corals with RefSeq (RRID:SCR_003496) annotations from the NCBI's Genome database. This genome set comprised assemblies for the species Orbicella faveolata [14], Stylophora pistillata [17], Pocillopora damicornis [15], and Acropora millepora [16]. We also retrieved the genome assembly of Porites lutea from reefgenomics.org. The quality of each of these assemblies was assessed using Quast and BUSCO as described above. The Lophelia pertusa assembly I has greater contiguity (N50) than most of the other scleractinian values were estimated through 500 rapid bootstrap replicates. The resulting tree topology is congruent with the most recent phylogeny for the group [65].

Re-use potential
The assembly of the Lophelia pertusa genome will facilitate numerous investigations into the ecology and evolution of this important species. This reference resource will enable population-genomic studies of this species within the US exclusive economic zone and comparative studies with populations throughout the Atlantic Ocean, Gulf of Mexico, Caribbean Sea, and Mediterranean Sea. This genome assembly will also be instrumental in resolving the taxonomic position of Lophelia pertusa as a monotypic genus instead of its proposed placement as a species, or set of species, within the genus Desmophyllum. This annotated genome assembly is the first one for a deep-sea scleractinian coral and thus will provide insights into the evolutionary history of deep-sea corals and the genomic adaptations to the deep-sea environment.

DATA AVAILABILITY
The sequencing data and metadata supporting the results of this article are available at the US National Library of Medicine, on NCBI under the BioProject accession PRJNA903949, BioSample accession SAMN31822850, WGS accession JAPMOT000000000, and SRA accessions SRR22387542 (Hi-C reads), SRR22387543 (Chicago reads), and SRR22387544 (PacBio reads). The RNA-seq data is available under the BioProject accession PRJNA922177.
A voucher of the Lophelia pertusa specimen sequenced in this study is available at the Smithsonian Institution National Museum of Natural History under the accession number USNM 1676648. The data is also available in the GigaDB repository [66].

List of abbreviations
CLR, continuous long reads; mbsl, meters below sea level; GO, gene ontology; NCBI, National Center for Biotechnology Information; nt, nucleotide collection database; ROV, remotely operated vehicle; WGS, Whole Genome Shotgun.

Ethics approval and consent to participate
Not applicable.

Consent for publication
Approved for publication by the Bureau of Ocean Energy Management.

Competing Interests
The authors declare that they have no competing interests.

Authors' contributions
SH and EEC conceptualized the project. EEC acquired and managed the funding, collected the samples, and provided computational resources. SH and EEC generated the data. SH curated the data, performed analyses, generated visualizations, and wrote the original draft.
SH and EEC reviewed and edited the manuscript.